STA 101 - Summer I 2022
Raphael Morsomme
Earlier, we saw that adding the predictor \(\dfrac{1}{\text{displ}}\) gave a better fit.
Let us see if the same idea work with the trees dataset.
Suppose we want to predict volume using only the variable girth.
One could argue that there is a slight nonlinear association
R function to compute \(R^2\)
We start with the simple model
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} \]
To capture the nonlinear association between Girth and Volume, we consider the predictor \(\text{Girth}^2\).
\[ \text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 \]
The R to fit this model is
d_tree2 <- mutate(d_tree, Girth2 = Girth^2)
m2 <- lm(Volume ~ Girth + Girth2, data = d_tree2)
compute_R2(m2)[1] 0.962
\(R^2\) has increased! It went from 0.935 to 0.962.
What if we also include the predictors \(\text{Girth}^3, \text{Girth}^4, \dots, \text{Girth}^{k}\) for some larger number \(k\)?
The following R code fits such a model with \(k=29\), that is, \[
\text{Volume} \approx \beta_0 + \beta_1 \text{girth} + \beta_2 \text{girth}^2 + \dots + \beta_{29} \text{girth}^{29}
\]
[1] 0.975
\(R^2\) has again increased! It went from 0.962 to 0.975.
In fact, if we keep adding predictors, \(R^2\) will always increase. - additional predictors allow the regression line to be more flexible, hence to be closer to the points and reduce the residuals.
Is the m_extreme model a good model? - does it accurately represent the population? - remember that we want to learn about the relation between Volume and Girth present in the population (not the sample).
Girth_new <- seq(min(d_tree$Girth), max(d_tree$Girth), by = 0.01)
new_data <- tibble(Girth = Girth_new)
Volume_pred <- predict(m_extreme, new_data)
new_data <- mutate(new_data, Volume_pred = Volume_pred)
head(new_data)# A tibble: 6 x 2
Girth Volume_pred
<dbl> <dbl>
1 8.3 10.2
2 8.31 10.3
3 8.32 10.4
4 8.33 10.5
5 8.34 10.6
6 8.35 10.7
The model m_extreme overfits the data.
A model overfits data when it corresponds very closely to the data set and does a poor job for new data.
Remember that we want to learn about the population, not the sample!
We saw that \(R^2\) keeps increasing as we add predictors.
\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST} \]
\(R^2\) can therefore not be used to identify models that overfit the data.
Instead, we use the adjusted-\(R^2\).
\[ R^2_{\text{adj}} = 1 - \dfrac{SSR}{SST}\dfrac{n-1}{n-p-1} \]
where \(p\) corresponds to the number of predictors.
The ratio \(\dfrac{SSR}{SST}\) favors model that closely fit the data.
The ratio \(\dfrac{n-1}{n-k-1}\) penalizes model with many predictors.
The model with the highest adjusted-\(R^2\) typically hits the sweet spot.
Two popular criteria that balance goodness of fit (small SSR) and parsimony (small \(p\)) are
The formula for AIC and BIC are respectively \[ AIC = 2p - \text{GoF}, \qquad BIC = \ln(n)p- \text{GoF} \] where \(\text{GoF}\) is a measure of the Goodness of fit of the model1.
Unlike the adjusted-\(R^2\), smaller is better for the AIC and BIC.
Note that the BIC penalizes the number of parameters \(p\) more strongly.
Easily accessible in R with the command glance. For instance,
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.935 0.933 4.25 419. 8.64e-19 1 -87.8 182. 186.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.962 0.959 3.33 350. 1.52e-20 2 -79.7 167. 173.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.975 0.956 3.44 51.4 5.11e-11 13 -73.0 176. 197.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
In this case, all three criteria (ajudtsed-\(R^2\), AIC and BIC) indicate that m2 id the best model.
The adjutsed-\(R^2\), AIC and BIC all try to balance
They achieve this balance by favoring models with small SSR while penalizing models with larger \(p\).
The form of the penalty for \(p\) may seem somewhat arbitrary. - E.g. AIC versus BIC.
Instead, we could look for the model with the best predictions performance; that is, the model that makes predictions for new observations that are the closest to the true values.
The holdout method is a simple method to evaluate the predictive performance of a model.
Note that the test set consists of new observations for the model.
A good model will model will have good prediction accuracy in step 3
The following R function splits a sample into a training and a test set
construct_training_test <- function(sample, prop = 2/3){
n <- nrow(sample)
n_training <- round(n*prop)
sample_random <- slice_sample(sample, n = n)
sample_training <- slice(sample_random, 1:n_training )
sample_test <- slice(sample_random, -(1:n_training))
return(list(
training = sample_training, test = sample_test
))
}d_car <- ggplot2::mpg
training_test_sets <- construct_training_test(d_car)
training_set <- training_test_sets[["training"]]
training_set# A tibble: 156 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 quattro 1.8 1999 4 manu~ 4 18 26 p comp~
2 audi a4 2 2008 4 auto~ f 21 30 p comp~
3 honda civic 1.6 1999 4 auto~ f 24 32 r subc~
4 toyota camry sol~ 2.2 1999 4 auto~ f 21 27 r comp~
5 toyota camry sol~ 2.4 2008 4 manu~ f 21 31 r comp~
6 audi a4 1.8 1999 4 auto~ f 18 29 p comp~
7 audi a6 quattro 3.1 2008 6 auto~ 4 17 25 p mids~
8 lincoln navigator~ 5.4 1999 8 auto~ r 11 16 p suv
9 land rover range rov~ 4.4 2008 8 auto~ 4 12 18 r suv
10 toyota land crui~ 5.7 2008 8 auto~ 4 13 18 r suv
# ... with 146 more rows
# A tibble: 78 x 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 volkswagen jetta 2.8 1999 6 manu~ f 17 24 r comp~
2 ford explorer ~ 4 1999 6 manu~ 4 15 19 r suv
3 nissan altima 2.5 2008 4 auto~ f 23 31 r mids~
4 chevrolet malibu 3.1 1999 6 auto~ f 18 26 r mids~
5 chevrolet malibu 2.4 2008 4 auto~ f 22 30 r mids~
6 dodge caravan 2~ 3.3 2008 6 auto~ f 11 17 e mini~
7 volkswagen jetta 2 2008 4 manu~ f 21 29 p comp~
8 volkswagen gti 2 2008 4 auto~ f 22 29 p comp~
9 toyota corolla 1.8 2008 4 manu~ f 28 37 r comp~
10 dodge caravan 2~ 3.3 2008 6 auto~ f 17 24 r mini~
# ... with 68 more rows
We now fit our regression model to the training set, as we have done many times.
To evaluate the prediction accuracy, we start by computing the predictions for the test set.
A good model will make predictions that are closed to the true values of the response variable.
A common measure of prediction accuracy is the root mean of squared errors (RMSE):
\[ RMSE = \sqrt{\dfrac{SSE}{m}} = \sqrt{\dfrac{(y_1 - \hat{y}_1)^2 + (y_2 - \hat{y}_2)^2 + \dots + (y_m - \hat{y}_m)^2}{m}} \]
where \(m\) corresponds to the size of the test set.
Apply steps 2 and 3 on different models.
Simply choose the model with the lowest SSE.
This model has the best out-of-sample accuracy.
Cross-validation (CV) is the same as the holdout method, but repeated many times.
drawback of the holdout method is that the test set matters a lot.
Repeating steps 2 and 3 with a different partition from step 1 may give different results.
Idea of CV: let each observation be in the test set once
Choose a number of folds \(k\) (typically \(5\) or \(10\)).
set.seed(345)
n_folds <- 10
county_2019_nc_folds <- county_2019_nc %>%
slice_sample(n = nrow(county_2019_nc)) %>%
mutate(fold = rep(1:n_folds, n_folds)) %>%
arrange(fold)
predict_folds <- function(i) {
fit <- lm(uninsured ~ hs_grad, data = county_2019_nc_folds %>% filter(fold != i))
predict(fit, newdata = county_2019_nc_folds %>% filter(fold == i)) %>%
bind_cols(county_2019_nc_folds %>% filter(fold == i), .fitted = .)
}
nc_fits <- map_df(1:n_folds, predict_folds)
p_nc_fits <- ggplot(nc_fits, aes(x = hs_grad, y = .fitted, group = fold)) +
geom_line(stat = "smooth", method = "lm", se = FALSE, size = 0.3, alpha = 0.5) +
scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
labs(
x = "High school graduate", y = "Uninsured",
title = "Predicted uninsurance rate in NC",
subtitle = glue("For {n_folds} different testing datasets")
)
p_nc_fitsset.seed(123)
n_folds <- 10
county_2019_ny_folds <- county_2019_ny %>%
slice_sample(n = nrow(county_2019_ny)) %>%
mutate(fold = c(rep(1:n_folds, 6), 1, 2)) %>%
arrange(fold)
predict_folds <- function(i) {
fit <- lm(uninsured ~ hs_grad, data = county_2019_ny_folds %>% filter(fold != i))
predict(fit, newdata = county_2019_ny_folds %>% filter(fold == i)) %>%
bind_cols(county_2019_ny_folds %>% filter(fold == i), .fitted = .)
}
ny_fits <- map_df(1:n_folds, predict_folds)
p_ny_fits <- ggplot(ny_fits, aes(x = hs_grad, y = .fitted, group = fold)) +
geom_line(stat = "smooth", method = "lm", se = FALSE, size = 0.3, alpha = 0.5) +
scale_x_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
scale_y_continuous(labels = label_percent(scale = 1, accuracy = 1)) +
labs(
x = "High school graduate", y = "Uninsured",
title = "Predicted uninsurance rate in NY",
subtitle = glue("For {n_folds} different testing datasets")
)
p_ny_fitsSet \(k=n\); that is, we use \(n\) folds, each of size \(1\). The test sets will therefore consist of a single observation and the training sets of \(n-1\) observations.
Not my favorite method,
but this is something you should learn because it is widely used.
. .
Stepwise selection procedures ar of two types: (i) forward and (ii) backward.
Start with the empty model \(Y \approx \beta_0\), i.e. the model with no predictor. This is our current model
Fit all possible models with one additional predictor.
. .
If the AIC of the candidate model is smaller (better) than the AIC of the current model, the candidate model becomes the current model, and we go back to step 3.
If the AIC of the candidate model is larger than the AIC of the current model (no new model improves on the current one), the procedure stops, and we select the current model.
Similar to forward selection, except that
Note that forward and backward selection need not agree; they may select different models.
05:00
Use domain knowledge
one-in-ten rule (could be one-in-five)